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Abstract 

This work concerns a comparison of SVM kernel methods in text 
categorization tasks. In particular I define a kernel function that es- 
timates the similarity between two objects computing by their com- 
pressed lengths. In fact, compression algorithms can detect arbitrarily 
long dependencies within the text strings. Data text vectorization 
looses information in feature extractions and is highly sensitive by tex- 
tual language. Furthermore, these methods are language independent 
and require no text preprocessing. Moreover, the accuracy computed 
on the datasets (Web-KB, 20ng and Reuters-21578), in some case, is 
greater than Gaussian, linear and polynomial kernels. The method 
limits are represented by computational time complexity of the Gram 
matrix and by very poor performance on non-textual datasets. 



1 Introduction 

In the world of discrete sequences (or sequential data) , learning problem is an 
important challenge in pattern recognition and machine learning. Classifi- 
cation tasks that involve symbolic data are very frequent. For instance, text 
categorization tasks, e.g. news, web pages and document classification, are 
widely employed. In such tasks, classification algorithms like support vector 
machine, neural networks and many others require the conversion of these 
symbolic sequences into feature vectors [ZJ. This preprocessing typically 
looses information. For instance, stemming phase maps words like showing, 
shows, shown into the same representative (suffix-free) feature word show. 
Furthermore, this stage is very language-dependent and sensitive, i.e. an 
english text stemmer is very different from a Spanish or a russian text stem- 
mer. Finally, other preprocessing procedures remove stop and short words. 
I employed a novel framework based on a different perspective. Textual 
data are treated as symbol sequences and by mining the structure of these 
sequences it is possible to define a similarity measure between sequence pairs. 
That's the definition of a kernel function over the features space. Thus a 
learning phase is needed to capture features from the given sequences. Then 
a similarity measure is required to quantify the shared features in sequence 
couples. Finally the kernel trick allows for applying classification algorithms 
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like SVMs. The aim of this work is to compare results obtained from classi- 
cal kernels with a compression based similarity kernel, i.e. the Normalized 
Compression Distance (NCD) [U El IU [5] . Both methods were exerted on 
Web-KB [IB] , two istance of Reuters-21578 [H] and on 20 Newsgroups [T5] 
datasets. 

The section [2] faces the statistical foundations of Variable Order Markov 
Models (VOMMs) and subsequently the definition of the Kncd kernel and 
the multiclass classification problem. In the section El I will present the 
vectorization method involved in the preprocessing phase of the datasets. 
In the same section, I will show the results obtained for both methods and 
datasets. Finally a brief dissertation about the symbolic learning is faced in 
section [H 

2 Methods 

From now, for the rest of document, I assume that S m = {(xi,y\), . . . , (x m , y m )} 
is the training set where each £ W and € {1, . . . , M}. M is the number 
of classes and p is the dimensionality of training vectors. As I introduced 
in the previous section, text classification requires special attention on how 
the data and its features are represented. In addition text categorization 
requires an ad hoc implementation for each natural language where it is 
applied. A general technique able to measure the similarity between same 
language texts saves a lot of implementation time. Starting by presenting 
multiclass extension of Support Vector Machine algorithm, then I introduce 
the Variable Order Markov Models (VOMMs) [8] underlying component 
of widely used compression algorithms like Prediction by Partial Matching 
(PPM) p], Context-Tree Weighting (CTW) [10] Lempel-Ziv Markov Chain 
(LZMA) [13] and Probabilistic Suffix Tree (PST) [T2] . Finally, a measure 
similarity function will be shown and the kernel based on this similarity 
measure will be presented. 

2.1 Support Vector Machine 

Support Vector Machine algorithm is a binary linear classifier that produces 
a separation hyperplane (whenever the training set is linearly separable) that 
partitions the trainining space into two classes [6j. The hyperplane equation 
represents the decision function for all the unseen data points. On the one 
hand, text categorization requires, usually, more than two classes. On the 
other hand, linear separation is a rare condition in real-world problems. For 
this reason I firstly introduced a SVM algorithm able to control with special 
variables (slack) the non-separability of a training set. This SVM extension 
is called soft margin classifier. Furthermore another important technique 
allows complex datasets to be linearly separable. Kernel functions, in fact, 
map the original dataset into a higher dimensional space where the linear 
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separation might be done. The combination of SVM and Kernel functions 
becomes a very powerful technique to face very complex classification prob- 
lems. 

First, I present the quadratic optimization problem in dual form without 
slack variable: 



maximize ^ a, - - ^ n,-n ,•//,•//,• {xi,Xj) 



subject to ai>0, t = l,...,m (1) 

m 

^2 a iVi = °- 
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where «j are the Lagrangian multiplier inherited from primal to dual prob- 
lem conversion and (x, y) is the inner product (within the inner product 
space) between x and y. Once the optimization is done, the set of on allows 
for classifying a new data point x with the decision function defined as 

f(x) = sign ^2aiyi (x, X{) +b\ (2) 

Let K(-, •) be a positive semi-definite kernel function. Thanks to Mercer's 
theorem, K could be expressed as dot product in a higher dimensional space, 
i.e. K{x,y) = (4>(x), 4>(y))- The kernel trick method provides that SVM, 
for instance, can be combined with the kernel function to obtain a linear 
classification in a higher dimensional space, defined implicitly by the kernel 
function. The original dual problem Q] could be rewritten as 



in ^ m 

i=l M = l 

subject to oti > 0, i = 1, . . . ,m (3) 
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and decision function [2] becomes 



f{x) = sign j ^2a>iyiK(x,Xi) + b\ (4) 



vi=l 



Even when in feature kernel space the data points are non linearly separable, 
an extension of the previous problem is needed. In this case, the slack 
variables £j > constitute the relaxation of the primal form constraints 
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yi((xi,w) + b) > 1 — £j, with i = l,...,m. Thus, the problem [3] becomes: 



maximize 




i=l *>j=l 



subject to 



o-i > 0, Vi = 1, 



m 



in 



(5) 
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Only the last constraint, that limits the Lagrangian multiplier values, dis- 
tinguishes problem [3] from [5j 

2.1.1 Multiclass SVM 

The natural extension of SVM binary classification problem into multiclass 
classification could be represented by the following optimization problem in 
primal form 



where m G {1, . . . ,M}\yi, yi € {1, . . . , M} is the multiclass label of the 
pattern %{. Computational issues suggest different multilabel classification 
strategies based on a combination of several binary classifiers. 

One-vs-One. In a first strategy, it is possible to train M(M — l)/2 binary 
classifiers for each class couples. The final decision function evaluates the 
decision function of every classifier and classifies the object assigning the 
class that obtains the highest number of votes. In this strategy, the number 
of binary classifier is quadratic but the computational time required for 
single classifier training is restricted because the data points evaluated are 
the number of examples belonging to the two trained classes. 

One-vs-the Rest. It is otherwise possible to train M binary classifiers 
one for every classes. In this case each classifier is trained to discriminate a 
class by all other classes. In this case, the decision function is defined as 



minimize 




subject to 




£ > o. 



m 



argmax 




i=l 
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In this strategy, the number of binary classifiers are linear but the compu- 
tational time required for single classifier training is higher than that used 
in the previous method since the number of data points evaluated is the 
entire training set S m . Moreover, in each training stage the binary classifier 
is usually trained on many more negative than positive examples. 
Notwithstanding no significative accuracy differences exist, in general, among 
the three methods, the one-vs-one is used in many SVM multiclass imple- 
mentations like libsvm. 

2.2 Variable Order Markov Models 

Data text vectorization looses information in feature extractions and it is 
highly sensitive to text language. In order to overcome these limitations, it 
is advantageous to employ sequential learning techniques that extract simi- 
larities directly on the textual learnt structure. 

Sequential data learning usually involves quite simple methods, like Hidden 
Markov Models (HMM), that are able to model complex symbolic sequences 
assuming hidden states that control the system dynamics. However, HMM 
training suffers from local optima and their accuracy performance has been 
overcome by VOMMs. Other techniques like A-gram models (or N order 
Markov Chains) compute the frequency of each N long subsequence. In this 
case the number of possible model states grows exponentially with N. Both 
computational space and time issues arise. 

In this perspective, the textual training sequence is generated by a station- 
ary unknown symbol source S = (E, P) where E is the symbol alphabet and 
P is the symbols probability distribution. A VOMM, given the maximum 
order D of conditional dependencies and a training sequence s generated by 
S, returns a model for the source S that's an estimation P of probability 
distribution P. Applying VOMMs, instead of A-gram models, takes several 
advantages. A VOMM estimation algorithm builds efficiently a model for S. 
In fact, only the occurred D-grams are stored and their conditional prob- 
abilities p(a\s) ,a G E and s € E d - D are estimated. This trick saves a lot 
of memory and computational time and makes feasible to model sequences 
with very long dependencies (D € [1, 10 3 ]) on 4GB personal computers. 

2.3 Lossless Compression Algorithms 

Lossless Compression Algorithms (LCAs) build a prefix tree to estimate the 
symbol probability distribution P by combining conditional probability of a 
symbol with a chain rule, given d previous symbols (usually d < D). In other 
words, LCAs produce a VOMM by some estimation algorithm in the first 
stage. In the second stage, LCAs compress actually the sequence applying 
some encoding scheme like Arithmetic Encoding (AE). The AE assigns a real 
value number within interval [0, 1) to the original sequence starting by the 
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estimated conditional probabilities p(a\s), a £ £ and s € S^- 15 [8]. Let C(-) 
be the function that computes the compressed sequence length through some 
GPL compressor like bzip2, ppmc, lzma. It is possible to prove that using 
average log-loss as estimation of prediction accuracy, prediction accuracy 
and compression ratio are equivalent [9]. Thus better predictions mean 
better compressions. Sequences easy to compress are sequences easy to 
learn and predict. 



2.4 Similarity Measure 

The function C brings toward the definition of a similarity measure. Once 
that the schemes from a sequence are detected then it is possible to measure 
how many of them are shared by another sequence schemes. With this aim, 
Cilibrasi et al [H [5] define a similarity measure that quantify the compression 
facility of a sequence x given the compression scheme of sequence y. The 
Normalized Compression Distance (NCD) is defined as follows: 

where xy represents the concatenation of sequence x with sequence y and 
C(x) is a function that returns the length of the compressed version of x. 
The range of NCD(x,y) is [0,1]. The NCD(x,y) = shows that x and 
y are identical whereas NCD(x,y) = 1 indicates that two objects are very 
dissimilar. The NCD function cannot work directly as a kernel function. 
In fact the NCD function is not symmetric. The symmetry property holds 
defining the kernel function KpfCDix,y) [3] as 

K , , NCD(x,y) + NCD(y,x) 

K n cd{x, y) = 1 (7) 

However, as in many string kernels, the semidefinite positive property cannot 
be proved. 



3 Experiments 

I used four different datasets to test accuracy and robustness of proposed 
methods in comparison to other standard kernels. These datasets are col- 
lated by Ana C ar doso- C achop o 1 1 8 j . The author split each dataset obtaining 
randomly two thirds of the documents for training and the remaining third 
for testing. The first dataset is the Web Knowledge Database (Web-KB), 
a collection of web pages by Carnegie Mellon University manually classified 
by text learning group. The second and third dataset are obtained from 
the Reuters-21578 dataset, that's the collection of classified Reuters news 
restricted to eight classes (R8) or fiftytwo (R52). Finally, the last dataset 
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represents a collection of approximately 20000 newsgroup documents col- 
lected by Ken Lang. The Table Q] reports the number of classes, number of 
training documents, number of testing documents and number of features 
for each dataset. For futher details consult the web page [18) , 
Every dataset has been processed following four stages: 

1. Terms are extracted from document. All letters are converted into 
lowercase and trimming of tabulations, multispaces and non- visible 
characters are done. 

2. Removing of less-than-3-characters-long terms 

3. Stop words removing 

4. Applying a stemming procedure 

For experiments with classical kernels, I used the 4 th stage stemmed datasets 
in order to decrease as many as possible features. The final vectorized train- 
ing/testing set represents the count of each appeared terms. Before, the 
training/testing stage, the dataset feature vectors are scaled into [—1,1] 
to prevent overfitting. The whole experimental stages are performed using 
Python programming language and I investigated the accuracy of the pro- 
posed kernel with the scikits . learns Python package that it's bound to 
the libsvm SVM implementation. Results that appear in Table [2] represent 
the best accuracy on the test datasets after a cross-validation procedure 
for the choice of the best model. To employ the KjycD kernel, I used the 
first stage datasets to compute the Gram matrix G = [k(xi,Xj)], with 
i,j = l,...,m. Although, the Gram matrix computation waste a lot of 
computational time, the complearn-tools package included in all Debian 
based Linux distributions (like Ubuntu) requires at most 5-20 minutes to 
compute the matrix thanks to its efficient multicore implementation [17] . 
The experiments ran on a Dell Precision workstation with 24 GB Ram and 
dual Quadcore Xeon X5677 at 3.46 Ghz. 

Furthermore, K^cd SVM were employed to perfom another pratical clas- 
sification task as handwritten recognition. In this case I used the 0-9 digit 
MNIST dataset. Results from the unsatisfactory experiment are not re- 
ported due to very disastrous performaces. The overall accuracy never ex- 
ceeds 54.2%. A discussion about this failure is reported in Section [H 
The accuracy of K^cd SVM kernel is higher, in some case, than that 
achieved by the classical SVM kernels like Gaussian, polynomial and lin- 
ear. The results are shown in the Table [2 The showed results are obtained 
after iT-fold cross-validation (with K = 5) sessions to fit the best kernel 
parameters that are reported in Table [3l For Kncd and linear kernels C is 
the only reasonable parameter that can influence the accuracy rather than 
the polynomial and Gaussian kernels that have other important parameters 
like d,7 and r. The model selection procedure computes the mean accuracy 
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Dataset 


^Classes 


# Train Docs 


#Test Docs 


^Features 


Web-KB 


4 


2803 


1396 


7770 


R8 


8 


5485 


2189 


17387 


R52 


52 


6532 


2568 


19241 


20ng 


20 


11293 


7528 


70216 



Table 1: Dataset characteristics. The column values represent respectively 
the number of classes, the dimension of training set, the dimension of testing 
set and the number of features. 



Dataset 


Kncd 


Linear 


Polynomial 


Gaussian 


Web-KB 


94.38% 


85.82% 


94.11% 


50.87% 


R8 


94.33% 


96.98% 


94.42% 


49.67% 


R52 


89.48% 


92.39% 


90.00% 


49.82% 


20ng 


87.71% 


84.26% 


86.81% 


48.27% 



Table 2: Accuracy of the proposed kernels on the four testing sets. 

of model with five- fold cross-validation and then stores the obtained result. 
Once that the procedure tests every admissible values for each parameter, 
the parameter combinations with higher accuracy is returned. 

4 Discussion 

Kolmogorov complexity K of an object x expressed as string (or symbol se- 
quence) represents the length of the shortest program, for a universal Turing 
machine, that outputs the x string. In other words, the Kolmogorov com- 
plexity, measures the amount of useful knowledge to compute a given object 
that is the semantic object content. The Kolmogorov Complexity is uncom- 
putable and this can be proved by the reduction from the uncomputability 



Dataset 


Kncd 


Linear 


Polynomial 


Gaussian 


Web-KB 


C=4 


C=0.07 


C=0.1,d = 6,7 = 0,r = 2 


C=7,7 = 9,d = 4 


R8 


C=3 


C=1.5 


C=0.1,d = 7,7 = 0.1,r = 2 


C=0.8,7 = 3,d = 2 


R52 


C=l 


C=2.8 


C=0.1,d = 7,7 = 0.1,r = 2 


C=1.4,7 = 2,d = 2 


20ng 


C=ll 


C=0.01 


C=2.3,d = 6,7 = 0.1,r = 2 


C=5,7 = 0,d = 1 



Table 3: Chosen SVM and kernel parameters after if -fold cross-validation 
with K = 5 over the training sets. The admissible values are respec- 
tively {0.01, 0.02,..., 0.1, 0.2,..., 3, 4,..., 20} for C, {1,2,..., 19} for d, 
{0, 0.1, . . . , 1} for 7 and {0, 1, ... , 6} for r 
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of the Halting Problem. The first important inequality is that: 

K{x) < \x\ + c, Vx 
where c is a costant and \x\ is the x length. 

Some information contents are syntactically accessible, some others not. For 
instance, considering the digits of the natural constant ir, no syntactic in- 
formation can be extracted. In fact ir (as many other natural constant) 
passes every randomness test. No structure can be extracted only from it's 
digits. Nevertheless it is quite simple to write a short computer program 
that outputs the 7r digits. Thus, only semantic information allows a ir digits 
compression. However many symbolic sequences involved in real- world prob- 
lems could be syntatically compressed. Moreover, many symbolic schemes 
are unaccessible by a human observer because the obvious undetectability 
of million symbol long recurrences within a string. 

Lossless compression algorithms allow syntactic compression of an object 
like a binary string. The basic idea is that given a fixed object, a com- 
pression algorithm is able to rewrite the object such that the length of the 
rewritten version is smaller than the original version length. The reduced 
object length proves the compression algorithm capacity to describe the ob- 
ject in terms of rules and schemes. Hence the compression algorithm abilities 
purely act on a syntactic level. In this way, the compressor code imposes 
an upper bound to Kolmogorov complexity. This upper bound is stronger 
than the previous inequality since: 

K(x) < C(x) < \x\ + 0(l),Vx 

where C(x) is the x compressed version length. The idea that compressor 
codes could approximate Kolmogorov complexity was first presented in some 
works [H El S] that brought to the definition of a similarity metric called 
Normalized Compression Distance and of a kernel based on it. Their results 
showed successful applications with unsupervised and supervised tasks such 
as text categorization, protein and music clustering. 

Learning of sequential data remains still an open challenge. VOMMs obtain 
good results in several classification tasks on symbolic data [8]. I remark 
as the NCD function is a feature-free distance function, i.e. the similarity 
estimation it is not based on some fixed features. On the contrary every 
other similarity measure is feature-based, i.e. requires detailed knowledge 
of the problem area in order to measure the similarity /dissimilarity between 
two objects. 

The failure of K^cd kernel on numerical datasets can be understood by an 
example. Considering the sequences s = "1.999999" and r = "2.000000", 
their meanings (the quantities) are very close while the symbolic sequences 
are very dissimilar having no common symbols. The same thing happens 
with synonyms in textual data. Again grow and arise are considered very 
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dissimilar. However, in real-world problems, situation like the latter are 
rare, while former ones are very common. In addition for a given object, the 
number of potential neighbors is an order of magnitude greater for numerical 
objects than for textual objects. 

5 Conclusions 

The accuracy of proposed kernel outperforms the accuracy of standard ker- 
nels with some datasets (20ng and Web-KB). The Kncd kernel method 
cannot carry out a classification task in general. In fact, compression based 
methods fail on numerical dataset because numbers (and their digits) enclose 
a coding, e.g. integer numbers. Furthermore computational time complex- 
ity constitutes a feasibility problem for a lot of pratical tasks. Nevertheless 
the good results highlight this promising framework. The K^cd kernel 
is independent by document languages and could be utilized with eastern 
ideographic languages. 
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